Forecasting Avocado Prices (ARIMA vs Prophet)

Problem Statement:

ABC farms, a Mexico based company produces a variety of Avocados which are sold in the US. They are having good success for past several years and want to expand. For this, they want to build and assess a plausible model to predict the average price of Hass avocado to consider the expansion of different type of Avocado farms that are available for growing in other regions.

Business Goal: Forecast the prices of Avocado in the US

Data Source: Retailers’ cash registers

Target: Since it's a regression task, choose a model which has higher $R^2$ score and/or least MAPE. We will be using following data related metrics:

Result: The prices needn't be forecasted on a live basis, so there is no need to deploy this model. Based on forecasted prices, stakeholders can decide about expanding their business to different types of Avocado farms. They can also focus on sales and price in each state and plan their strategy accordingly.

Data

The data comes directly from retailers’ cash registers based on the actual retail sales of Hass avocados.

Some relevant columns in the dataset:

As mentioned above there are two types of avocados in the dataset as well as several different regions represented. This allows you to do all sorts of analysis for different areas of the United States, specific cities, or just the overall United States on either type of avocado. Our analysis will be focused on the complete dataset.

Let's code!

Data

The data comes directly from retailers’ cash registers based on the actual retail sales of Hass avocados.

Some relevant columns in the dataset:

As mentioned above there are two types of avocados in the dataset as well as several different regions represented. This allows you to do all sorts of analysis for different areas of the United States, specific cities, or just the overall United States on either type of avocado. Our analysis will be focused on the complete dataset.

Understanding the data

Let’s first look at the columns and the data

EDA

We see two categorical variables, type and region. Let’s examine them.

We can see that at each year present in the data set, total count of both types of avocados. They are almost in same amount in the data set.

Also, it is interesting to see that year 2017 is aggressive year where avocado price is higher as compared to other year and 2015 is at second number.

Plotting Histogram:

For plotting this histogram, we used the bin size as 20, we can take any bin size (suited as per as data).

The best skew value for normally distributed data is very close to zero, so we are using “log1p” method to make the skew value near to zero

So, on average organic avocados are more expensive than conventional.

Correlation Matrix: Correlation Matrix is basically a covariance matrix. A summary measure called the correlation describes the strength of the linear association. Correlation summarizes the strength and direction of the linear (straight-line) association between two quantitative variables. It takes values between -1 (Negative correlated value) and +1 (Positive correlated value).

The higher the value, the strong is the relation and vice versa.

We will use seaborn heatmap to plot the correlated matrix and plot the corr value in the heatmap graph

Feature Engineering

Train-test splitting

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

It is a fast and easy procedure to perform, the results of which allow you to compare the performance of machine learning algorithms for your predictive modeling problem. Although simple to use and interpret, there are times when the procedure should not be used, such as when you have a small dataset and situations where additional configuration is required, such as when it is used for classification and the dataset is not balanced.

Configuring the split:

The procedure has one main configuration parameter, which is the size of the train and test sets. This is most commonly expressed as a percentage between 0 and 1 for either the train or test datasets. For example, a training set with the size of 0.67 (67 percent) means that the remainder percentage 0.33 (33 percent) is assigned to the test set.

The scikit-learn library provides an implementation of the train-test split evaluation procedure via the train_test_split() function.

The function takes a loaded dataset as input and returns the dataset split into two subsets.

Converting categorical columns to numerical columns

There are many ways to convert categorical values into numerical values. Each approach has its own trade-offs and impact on the feature set. Hereby, we would focus on 2 main methods: Label-Encoding and One-Hot-Encoding. Both of these encoders are part of scikit-learn library and are used to convert text or categorical data into numerical data which the model expects and performs better with.

Label Encoding

This approach is very simple and it involves converting each value in a column to a number.

Using category codes approach (non-sklearn):

This approach requires the category column to be of 'category' datatype. By default, a non-numerical column is of 'object' type. So we need to change type to 'category' before using this approach.

Though label encoding is straight but it has the disadvantage. Depending upon the data values and type of data, label encoding induces a new problem of number sequencing. The problem using the number is that they introduce relation/comparison between them. Apparently, there is no relation between various regions, but when looking at the number, one might think that 'DallasFtWorth' region has higher precedence over 'NewYork' region. The algorithm might misunderstand that data has some kind of hierarchy/order 0 < 1 < 2 … and might give more weight to 'DallasFtWorth' in calculation then than 'NewYork' region type.

This ordering issue is addressed in another common alternative approach called 'One-Hot Encoding'.

One-Hot encoding for categorical variables with multiple levels

In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column.

OneHotEncoder from SciKit library only takes numerical categorical values, hence any value of string type should be label encoded before one hot encoded.

Adding the one-hot encoded columns to the dataframe and removing the original feature

Though this approach eliminates the hierarchy/order issues but does have the downside of adding more columns to the data set. It can cause the number of columns to expand greatly if you have many unique values in a category column. In our case, it was manageable, but it will get really challenging to manage when encoding gives many columns.

Regression Modeling

Pipeline sequentially applies a list of transforms and a final estimator. The intermediate steps of pipeline must implement fit and transform methods and the final estimator only needs to implement fit.

As the name suggests, pipeline class allows sticking multiple processes into a single scikit-learn estimator. pipeline class has fit, predict and score method just like any other estimator (ex. LinearRegression). Pipeline can be necessary at times as it helps to enforce desired order of application steps, creating a convenient work-flow, which makes sure of the reproducibility of the work.

We will be using StandardScaler, which subtracts the mean from each features and then scale to unit variance.

Time Series Forecasting

Time series is a sequence of observations recorded at regular time intervals.

Depending on the frequency of observations, a time series may typically be hourly, daily, weekly, monthly, quarterly and annual. Sometimes, you might have seconds and minute-wise time series as well, like, number of clicks and user visits every minute etc.

Forecasting is the step where you want to predict the future values the series is going to take. Forecasting a time series (like demand and sales) is often of tremendous commercial value.

Time Series Components:

A useful abstraction for selecting forecasting methods is to break a time series down into systematic and unsystematic components. A given time series is thought to consist of three systematic components including level, trend, seasonality, and one non-systematic component called noise.

A series is thought to be an aggregate or combination of these four components. All series have a level and noise. The trend and seasonality components are optional.

It is helpful to think of the components as combining either additively or multiplicatively.

We will focus on a particular type of forecasting method called ARIMA modeling.

ARIMA

ARIMA, short for 'Auto Regressive Integrated Moving Average' is a forecasting algorithm that 'explains' a given time series based on its own past values, that is, its own lags and the lagged forecast errors, so that equation can be used to forecast future values.

Any 'non-seasonal' time series that exhibits patterns and is not a random white noise can be modeled with ARIMA models.

An ARIMA model is characterized by 3 terms: p, d, q:

The first step to build an ARIMA model is to make the time series stationary. Why?

Because, term 'Auto Regressive' in ARIMA means it is a linear regression model that uses its own lags as predictors. Linear regression models, as you know, work best when the predictors are not correlated and are independent of each other.

How to make a series stationary?

The most common approach is to difference it. That is, subtract the previous value from the current value. Sometimes, depending on the complexity of the series, more than one differencing may be needed.

d is the minimum number of differencing needed to make the series stationary. And if the time series is already stationary, then d = 0.

A pure Auto Regressive (AR only) model is one where Yt depends only on its own lags. And, p is the order of the 'Auto Regressive' (AR) term. It refers to the number of lags of Y to be used as predictors.

A pure Moving Average (MA only) model is one where Yt depends only on the lagged forecast errors. And, q is the order of the 'Moving Average' (MA) term. It refers to the number of lagged forecast errors that should go into the ARIMA Model.

ARIMA model in words:

Predicted Yt = Constant + Linear combination of Lags of Y (upto p lags) + Linear Combination of Lagged forecast errors (upto q lags)

Choose parameters for ARIMA

As we can see from the plot, avocado prices were relatively flat for 2015 and much of 2016. It wasn’t until the middle of 2016 that there were "large" swings in prices that were duplicated again in 2017.

Let's look at the components of our time series:

We will use additive model, which suggests that the components are added together as follows:

Yt = Level + Trend + Seasonality + Noise

An additive model is linear where changes over time are consistently made by the same amount.

Find order of differencing i.e. 'd':

The purpose of differencing it to make the time series stationary.

But we need to be careful to not over-difference the series. Because, an over differenced series may still be stationary, which in turn will affect the model parameters.

First, we will check stationarity using Augmented Dickey Fuller Test. The null hypothesis of the ADF test is that the time series is non-stationary. So, if the p-value of the test is less than the significance level (0.05) then you reject the null hypothesis and infer that the time series is indeed stationary.

Since p-value is greater than the significance level, let’s difference the series and see how the autocorrelation plot looks like.

From above plots we can see that the time series reaches stationarity with one order of differencing.

We can also use 'ndiffs' to estimate 'd'. It performs a test of stationarity for different levels of d to estimate the number of differences required to make a given time series stationary. It will select the maximum value of d for which the time series is judged stationary by the statistical test.

Find order of AR term i.e. 'p':

The next step is to identify if the model needs any AR terms. You can find out the required number of AR terms by inspecting the Partial Autocorrelation (PACF) plot. Partial autocorrelation can be imagined as the correlation between the series and its lag, after excluding the contributions from the intermediate lags.

Let's see how partial autocorrelation plots look like.

We can observe that the PACF without any lag is quite significant and well within the significance limit (blue region).So we will choose p as 0.

Find order of MA term i.e. 'q':

Just like how we looked at the PACF plot for the number of AR terms, you can look at the ACF plot for the number of MA terms. An MA term is technically, the error of the lagged forecast.

The ACF tells how many MA terms are required to remove any autocorrelation in the stationarized series.

Let’s see the autocorrelation plot of the differenced series.

The lags are well within the significance limit. So, we wil choose q as 0.

Build ARIMA Model

Let’s plot the residuals to ensure there are no patterns (that is, look for constant mean and variance).

The residual errors seem fine with near zero mean and uniform variance.

Forecasting and Evaluation

Let's plot the actuals against the fitted values using plot_predict(). When you set "dynamic" as 'False', the in-sample lagged values are used for prediction.

Around 2.9% MAPE implies the model is 97.1% accurate in making predictions.

The problem with plain ARIMA model is it does not support seasonality.

SARIMA

If your time series has defined seasonality, then, go for SARIMA which uses seasonal differencing.

Seasonal differencing is similar to regular differencing, but, instead of subtracting consecutive terms, you subtract the value from previous season.

So, the model will be represented as SARIMA(p,d,q)x(P,D,Q), where, P, D and Q are SAR, order of seasonal differencing and SMA terms respectively and 'x' is the frequency of the time series.

If your model has well defined seasonal patterns, then enforce D=1 for a given frequency 'x'.

Forecasting using Facebook Prophet

Install Prophet using either command prompt or Anaconda prompt using pip:

pip install fbprophet

We need to install plotly for plotting the data for prophet

pip install plotly

Prophet is an open-source library designed for making forecasts for univariate (one variable) time series datasets. It is easy to use and designed to automatically find a good set of hyperparameters for the model in an effort to make skillful forecasts for data with trends and seasonal structure by default. It was initially developed for the purpose of creating high quality business forecasts. It helps businesses understand and possibly predict the market. This library tries to address the following difficulties common to many business time series:

It is based on a decomposable additive model where non-linear trends are fit with seasonality, it also takes into account the effects of holidays. Before we head right into coding, let’s learn certain terms that are required to understand this.

Trend: The trend shows the tendency of the data to increase or decrease over a long period of time and it filters out the seasonal variations.

Seasonality: Seasonality is the variations that occur over a short period of time and is not prominent enough to be called a “trend”.

Understanding the Prophet Model: The general idea of the model is similar to a generalized additive model. The “Prophet Equation” fits, as mentioned above, trend, seasonality and holidays. This is given by,

y(t) = g(t)+s(t)+h(t)+e(t)

where,

Input to Prophet is a dataframe which must have a specific format. The first column must have the name 'ds' while the second column must have the name 'y'.

ds is datestamp column and should be as per pandas datetime format, YYYY-MM-DD or YYYY-MM-DD HH:MM:SS for a timestamp.

y is the numeric column we want to predict or forecast.

This means we change the column names in the dataset. It also requires that the first column be converted to date-time objects, if they are not already.

Fit Prophet Model

Prophet's API is very similar to the one you can find in sklearn. To use Prophet for forecasting, first, a Prophet() object is defined and configured, then it is fit on the dataset by calling the fit() function and passing the data, and, finally, make a forecast.

The Prophet() object takes arguments to configure the type of model you want, such as the type of growth, the type of seasonality, and more. By default, the model will work hard to figure out almost everything automatically.

The fit() function takes a DataFrame of time series data.

Now we need to create a new Prophet object. Here we can pass the parameters of the model into the constructor. Intially, we will use the defaults. Then we train our model by invoking its fit method on our training dataset:

Forecasting

A forecast is made by calling the predict() function and passing a DataFrame that contains one column named 'ds' and rows with date-times for all the intervals to be predicted.

Using the helper method Prophet.make_future_dataframe, we create a dataframe which will contain all dates from the history and also extend into the future for those 50 weeks that we left out before.

The result of the predict() function is a DataFrame that contains many columns. Perhaps the most important columns are the forecast date time ('ds'), the forecasted value ('yhat'), and the lower and upper bounds on the predicted value ('yhat_lower' and 'yhat_upper') that provide uncertainty of the forecast.

Forecast quality evaluation

MAPE is widely used as a measure of prediction accuracy because it expresses error as a percentage and thus can be used in model evaluations on different datasets.

In addition, when evaluating a forecasting algorithm, it may prove useful to calculate MAE (Mean Absolute Error) in order to have a picture of errors in absolute numbers.

Visualization

The Prophet library has its own built-in tools for visualization that enable us to quickly evaluate the result.

First, there is a method called Prophet.plot that will create a plot of the dataset and overlay the prediction with the upper and lower bounds for the forecast dates:

The second function Prophet.plot_components might be much more useful in our case. It allows us to observe different components of the model separately: trend, yearly and weekly seasonality. In addition, if you supply information about holidays and events to your model, they will also be shown in this plot.

The above plots shows the trends and seasonality(in a year) of the time series data. We can see there is first decreasing and then an increasing trend, meaning the price of avocado initially dipped and then has increased over time. If we look at the seasonality graph, we can see that September to November have the highest prices at a given year.

Adding ChangePoints to Prophet

Changepoints are the datetime points where the time series have abrupt changes in the trajectory.

By default, Prophet adds 25 changepoints into the initial 80% of the data-set. The number of changepoints can be set by using the n_changepoints parameter when initializing prophet (e.g., model=Prophet(n_changepoints=30)).

Let’s plot the vertical lines where the potential changepoints occurred

Adjusting Trend

Prophet allow you to adjust the trend in case there is an overfit (too much flexibility) or underfit (not enough flexibility).

changepoint_prior_scale helps adjust the strength of the trend. The default value for changepoint_prior_scale is 0.05.

Decrease the value to make the trend less flexible. Increase the value of changepoint_prior_scale to make the trend more flexible.

Adding Holidays

Holidays and events can cause changes to a time series. In our example the National Avocado day on July 31 and Guacamole day on September 16 can impact prices of the Avocado.

We can create a custom holiday list for Prophet by creating a dataframe with two columns ds and holiday. A row for each occurrence of the holiday

lower window and upper window extend holiday to days around the date. If we want to include a day prior to the national avocado day and Guacamole day, we set lower_window: -1 upper_window: 0

If we wanted to use the day after the holiday then set lower_window: 0 upper_window: 1

Adding Multiple Regressors

Additional regressors can be added to the Prophet model. This is done by using add_regressor. Additional regressor column value needs to be present in both the fitting as well as prediction dataframes.